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Abstract:  We  describe  here  our  construction  of  lexical  resources,  tool  creation,  building  of  an 
aligned  parallel  corpus,  and  an  approach  to  automatic  treebank  creation  that  we  have  been 
developing  using  Spanish  data,  based  on  projection  of  English  syntactic  dependency  information 
across  a  parallel  corpus. 


Introduction 

NLP  researchers  at  the  University  of  Maryland  are 
currently  working  on  the  construction  of  resources 
and  tools  for  several  multilingual  applications,  with 
a  focus  on  broad  coverage  machine  translation  (MT) 
and  cross-language  information  retrieval.  We 
describe  here  our  construction  of  lexical  resources, 
tool  creation,  building  of  an  aligned  parallel  corpus, 
and  an  approach  to  automatic  treebank  creation, 
which  we  have  been  developing  using  Spanish  data, 
based  on  projection  of  English  syntactic  dependency 
information  across  a  parallel  corpus. 

Creating  lexical  databases  for  Spanish 

We  have  built  two  types  of  lexical  databases 
for  Spanish:  one  that  is  semantico-syntactic,  based 
on  a  representation  called  Lexical  Conceptual 
Structure  (ECS),  and  one  that  is  morphological, 
based  on  Kimmo-style  Spanish  entries. 

An  ECS  is  a  directed  graph  with  a  root  that 
reflects  the  semantics  of  a  lexical  item  by  a 
combination  of  semantic  structure  and  semantics 
content.  ECS  representations  are  both  language  and 
structure  independent;  they  were  originally 
formulated  by  Jackendoff  (1983,  1990)  and  have 
been  used  as  interlingua  in  a  number  of  machine 
translation  projects  including  UNITRAN  and  MILT 
(Dorr  1993;  Dorr  1997). 

The  creation  of  a  Spanish  ECS  lexicon 
relied  heavily  on  the  existence  of  a  large  hand¬ 
generated  database  of  English  ECS  entries,  which 
were  ported  over  to  Spanish  ECS  entries  by  means 
of  a  bilingual  lexicon  and  acquisition  procedures  as 
described  in  Dorr  (1997). 

Our  Spanish  morphological  lexicon  was 
originally  derived  from  a  two-level  Kimmo-based 
morphology  system  (Dorr  1993).  This  lexicon 
contains  273  roots  and  99  types  of  endings,  with  an 


upper  bound  of  possible  morphological  realizations 
of  27,027  (the  product  of  number  of  roots, 
multiplied  by  the  number  of  endings). 

The  structure  of  the  lexicon  consists  of  (a) 
root  word  entries,  (b)  continuation  classes,  and  (c) 
endings. 

(DEF-MORPH-ROOT  language  root  features 
(string-rootl  continuation-class  1  features  1) 
(string-root2  continuation-class2  features2)) 

For  example,  the  Spanish  words  ‘veo’  (T  see’)  and 
‘visto’  (‘seen’),  would  have  the  following  entries  in 
the  lexicon: 

(DEF-MORPH-ROOT  Spanish  VER  [v] 

(“ve”  *ER-IRREG-6  NIL) 

(“visto”  NIL  [perf-tns])) 

We  used  this  lexicon  for  English-to-Spanish 
query  translation  in  several  cross-language 
information  retrieval  experiments.  The  results  were 
presented  at  the  First  International  Conference  on 
Language  Resource  Evaluation  (LREC)  in  Granada, 
Spain  (Dorr  and  Oard  1998). 

Applying  a  Spanish  LCS  Lexicon  in  MT 

We  have  experimented  with  an  interlingual  approach 
to  Spanish-English  machine  translation,  using  LCS 
representations  as  the  interlingua.  In  our  most  recent 
experiments  in  Spanish  to  English  translation,  we 
have  used  LCS  together  with  Abstract  Meaning 
Representations  (AMR)  as  developed  at  USC/ISI 
(Langkilde  and  Knight,  1998a).  AMRs  are  semantic- 
syntactic  language-specific  representations. 

After  parsing  the  Spanish  sentence,  we 
create  a  semantic  representation  (LCS),  which  is 
then  transformed  into  a  syntactic-semantic 
representation  of  the  target  language  sentence 
(AMR).  This  representation  serves  as  the  input  to 


Nitrogen,  a  generation  tool  developed  by  USC/ISI 
(Langkilde  and  Knight,  1998a;  Langkilde  and 
Knight  1998b).  Nitrogen  is  responsible  for  (a) 
transforming  the  Spanish  syntactic  representation 
into  an  English  syntactic  representation,  (b)  Creating 
a  word  by  generating  all  the  possible  surface 
orderings  (linearizations)  for  the  English  sentence, 
(c)  Using  a  n-gram  language  model  to  choose  the 
optimal  linearization,  and  finally  (d)  generating 
morphological  realizations,  i.e.  producing  the 
surface  form  for  the  English  sentence  which 
corresponds  to  the  translation  of  the  Spanish  original 
sentence. 

Acquiring  bilingual  dictionary  entries 

In  addition  to  building  and  applying  the  more 
sophisticated  ECS  lexical  representations,  we  have 
explored  the  automatic  acquisition  of  simple  word- 
to-word  correspondences  from  parallel  corpora, 
based  on  cross-language  statistical  association 
between  word  co-occurrences.  The  noisy, 
confidence-ranked  bilingual  lexicons  obtained  in  this 
way  can  be  useful  in  porting  ECS  lexicons  to  new 
languages,  as  described  above,  and  are  also  useful  by 
themselves  in  improving  dictionary-based  cross 
language  information  retrieval  (Resnik,  Card,  and 
Eevow,  2001). 

Constructing  an  Aligned  Corpus 

Parallel  corpora  have  emerged  as  a  crucial 
resource  for  acquiring  and  improving  lexical 
resources  such  as  bilingual  lexicons,  and  for 
developing  broad  coverage  machine  translation 
techniques.  We  have  therefore  devoted  effort  to 
acquiring  English-Spanish  parallel  text  using 
traditional  and  less  traditional  channels. 

Collecting  Parallel  Text 

We  have  obtained  parallel  data  in  three  ways.  First, 
we  have  taken  advantage  of  community-wide  corpus 
distribution  channels,  such  as  the  Einguistic  Data 
Consortium  (EDC),  the  European  Eanguage 
Resource  Distribution  Agency  (EEDA)  and  the 
Foreign  Broadcast  Information  Service  (FBIS). 
These  sources  provide  data  that  are  generally  clean 
and  often  aligned  or  easily  alignable,  and  which 


have  the  advantage  of  being  available  in  common  to 
a  large  community  of  researchers. 

Second,  we  have  collected  parallel  text  from 
the  World  Wide  Web  using  the  STRAND  system  for 
acquiring  parallel  Web  documents  (Resnik,  1999). 
(One  such  collection  of  Spanish-English  documents 
is  available,  as  a  set  of  URE  pairs,  at 
http://umiacs.umd.edu/~resnik/strand.)  Data 

collected  from  the  Web  have  the  advantage  of  great 
diversity  in  contrast  to  the  often  more  domain-  or 
genre-specific  forms  of  text  available  from  standard 
sources;  on  the  other  hand,  they  are  also  often  of 
extremely  diverse  quality. 

Third,  we  have  obtained  a  parallel  English- 
Spanish  version  of  the  Bible  as  part  of  our  general 
project  collecting  freely  available  Bible  versions  and 
annotating  their  parallel  structure  using  the  Corpus 
Encoding  Standard  (CES),  as  a  parallel  resource  for 
use  in  computational  linguistics.  Our  empirical 
studies  of  the  Bible’s  size  and  vocabulary  coverage  - 
using  EDOCE  and  the  Brown  Corpus  for 
comparison  -  suggest  that  modern- language  Bibles 
are  a  surprisingly  viable  source  of  information  about 
everyday  language  research  (Resnik,  Olsen,  and 
Diab,  1999).  CES-annotated  parallel  English  and 
Spanish  versions  are  available  on  the  Web  at 
http  ://umiacs .  umd.edu/~resnik/parallel/. 

In  the  work  we  describe  here,  we  have  been 
focusing  our  development  on  the  Spanish-English 
United  Nations  Parallel  Corpus,  available  from 
EDC,  which  has  data  generated  from  1989  through 
1991. 

Aligning  the  Text  at  the  Sentence  Level 

The  U.N.  Parallel  Corpus  is  already  aligned  at  the 
document  level.  Our  alignment  of  the  corpus  at 
lower  levels  uses  a  combination  of  existing  tools  and 
components  we  have  constructed. 

As  a  first  stage  in  below-document-level 
alignment,  we  preprocess  the  text  in  order  to  obtain 
alignments  at  the  paragraph  level  using  simple 
document  structure.  HTME-style  markup, 
indicating  a  number  of  within-text  boundaries  above 
the  sentence  level,  is  introduced  automatically  on  the 
basis  of  relevant  cues  in  the  text.  The  resulting 
marked-up  document  is  passed  to  a  structure-based 
alignment  tool  designed  for  use  with  HTME 
documents  (Resnik,  1999),  which  uses  dynamic 
programming  (Unix  difj)  to  generate  an  alignment 
between  text  chunks  on  the  basis  of  correspondences 


in  markup.  Because  only  boundary  markup  is  used, 
not  content,  the  process  is  entirely  language 
independent.  Although  the  introduction  of  markup 
is  pattern-based  and  therefore  somewhat  heuristic,  it 
succeeds  well  at  avoiding  the  introduction  of 
spurious  (intra-sentential)  boundaries. 

Next,  we  used  MXTERMINATOR  (Reynar 
and  Ratnaparkhi,  1997)  to  break  multi-sentence 
chunks  into  sentences  boundaries  both  in  Spanish 
and  English.  This  is  a  supervised  system  based  on 
maximum  entropy  models  that  learns  sentence 
boundaries  from  correctly  boundary-annotated  text. 
Thus  far  we  have  used  a  version  trained  on  English 
text,  and  we  have  found  that  it  performs  reasonably 
well  for  both  Spanish  and  English.  Our  sentence- 
level  alignment  of  the  U.N.  parallel  data  produced 
roughly  300,000  sentences  per  side. 

Tokenization 

Our  ultimate  goal  being  word-level  alignment,  we 
required  tokenized  text.  We  implemented  a  tokenizer 
for  Spanish  using  a  number  of  Perl  pattern  matching 
rules,  some  of  them  adapted  from  the  Spanish 
Kimmo-style  morphological  analyzer  (Dorr,  1993). 
In  its  current  state,  this  tokenizer  removes  SOME 
tags,  bad  spacing  characters  (tabs/spaces/ansi  space/ 
etc.)  and  punctuation  (in  the  case  of  periods  at  the 
end  of  the  sentence,  it  actually  separates  them  from 
the  preceding  word).  It  also  merges  over  2000 
frequently  co-occurring  words  that  form  fixed 
expressions,  e.g.  the  tokens  in  'dentro  de'  will  be 
merged  into  'dentrode'.  Finally,  it  performs 
morphological  analysis.  In  the  case  of  verbs,  it  uses 
70  Perl  substitution  rules  in  order  to  make  sure  that 
the  accentuation  patterns  and  spelling  change 
according  to  the  resulting  verb  base  form.  For 
example,  the  first  person  singular  'fmjo'  (I  fake) 
becomes  the  infinitive  'fingir'  and  not  *'fmjir'.  This 
tokenizer  has  been  used  in  our  initial  dependency 
tree  inference  experiments  for  Spanish,  described 
below. 

Aligning  Text  at  the  Word  Level 

Once  the  text  has  been  reduced  to  aligned  sentences, 
we  train  IBM  statistical  MT  models  using  software 
developed  by  Al-Onaizan  et  al.  (1999).  The  training 
process  produces  model  parameters  and,  as  a  side- 
effect,  it  produces  the  most  likely  word-level 
alignment  for  each  sentence  pair  in  the  training 


corpus.  Preliminary  analysis  of  these  alignments  is 
what  led  us  to  move  from  an  extremely 
unsophisticated  Spanish  tokenizer  to  one  that  takes 
into  account  morphology  and  frequent  multi-word 
co-occurrences. 

Creating  a  Noisy  Spanish  Treebank 

Statistical  methods  in  NEP  have  led  to  major 
advances,  with  supervised  training  methods  leading 
the  way  to  the  greatest  improvements  in 
performance  on  tasks  such  as  part-of-speech  tagging, 
syntactic  disambiguation,  and  broad-coverage 
parsing.  Unfortunately,  the  annotated  data  needed 
for  supervised  training  are  available  for  only  a  small 
number  of  languages. 

The  University  of  Maryland  has  recently 
begun  a  project  in  collaboration  with  Johns  Hopkins 
University  aimed  at  breaking  past  this  bottleneck.  A 
central  idea  in  this  effort  is  to  take  advantage  of  the 
rich  resources  available  for  English,  together  with 
parallel  corpora:  the  English  side  of  a  parallel  corpus 
is  annotated  using  existing  tools  and  resources,  and 
the  results  projected  to  the  language  on  the  other 
side  using  word-level  alignments  as  a  bridge;  finally 
supervised  training  is  used  to  create  tools  that 
perform  well  despite  noise  in  the  automatically 
annotated  corpus.  Yarowsky  et  al.  (2001)  have 
shown  extremely  promising  results  of  this 
annotation-projection  technique  for  part-of-speech 
tagging,  named  entities,  and  morphology,  and  at 
Maryland  we  have  been  focusing  on  the  challenges 
of  projecting  syntactic  dependency  relations. 

Figure  1  shows  our  baseline  architecture, 
which  includes  not  only  the  creation  of  a  noisy 
treebank  but  also  its  application  in  an  end-to-end 
machine  translation  process.  Briefly,  a  word-aligned 
parallel  corpus  is  created  as  discussed  in  the 
previous  section.  The  English  side  is  analyzed  using 
Dekang  Fin’s  Minipar  parser  (Fin,  1997),  which 
produces  syntactic  dependencies,  e.g.  indicating 
arguments  of  verbs,  modifiers,  etc.  Crucially,  the 
resulting  dependency  representation  is  independent 
of  word  order. 

Projection  of  syntactic  dependencies  relies 
on  a  fairly  strong  hypothesis:  that  major 

grammatical  relations  are  preserved  across 
languages.  Operationally,  the  transfer  process 
begins  by  assuming  that  if  words  Ci  and  Q2  in  English 
correspond  to  Si  and  S2  in  Spanish,  respectively,  and 
there  is  a  dependency  relation  r  between  ci  and  e2. 


then  r  will  hold  between  si  and  S2.  For  example, 
‘blaek  cat’  in  English  corresponds  to  ‘gato  negro’  in 
Spanish.  Therefore  the  relationship 
adjmod(cat, black)  is  transferred  into  the  Spanish 
analysis  as  adjmod(gato, negro).  Notice  that  the 
relationship  abstracts  away  from  word  order.  These 
resulting  representations  constitute  a  noisy 
dependency  treebank,  which  we  are  using  as  the 
training  set  for  Ratnaparkhi’s  (1997)  MXPOST  POS 
tagger  and  Collins’s  (1997)  stochastic  parser. 


Figure  1 .  Baseline  Dependency  Transfer  Architecture 


As  stated,  the  hypothesis  of  direct 
dependency  transfer  is  clearly  false  -  indeed,  the 
issue  of  divergences  in  translation  has  been  an 
important  focus  in  our  previous  work  (Dorr,  1993). 
Flowever,  we  are  optimistic  that  cross-language 
correspondence  of  dependencies  is  a  suitable  starting 
point  for  investigation  on  both  theoretical  and 
empirical  grounds.  Theoretically,  grammatical 
relations  are  closer  than  constituency  relations  to  the 
thematic  relationships  underlying  the  sentence 
meaning  common  to  both  sides  of  the  translation 
pair;  thus  the  fundamental  correspondences  are 
likely  to  hold  much  of  the  time.  Moreover,  lexical 
dependencies  have  proven  to  be  instrumental  in 
advances  in  monolingual  syntactic  analysis  (e.g. 


prp  vbd  dt  nn  nn  in  prpS  nn 

I  got  a  wedding  gift  for  my  brother 


nik  nire  anaiari  /  ezkontza  opari  bat  erosi  nion 

I-erg  MY  BROTHER-dat  WEDDING  GIFT  a-abs  BUY-past 
prp  prp$  nn  nn  nn  vbd 


Figure  2.  An  example  of  dependency  transfer 


Collins,  1997).  These  considerations  distinguish  our 


approach  from  Wu’s  (2000)  approach,  which 
characterizes  the  cross-language  syntactic 
relationships  using  a  non-lexicalized  bilingual 
grammar  formalism. 

Our  second  cause  for  optimism  is  empirical: 
in  preliminary  efforts  we  have  attempted  the  direct 
dependency  transfer  approach  with  Spanish  and 
Chinese,  with  bilingual  speakers  and  linguists 
inspecting  the  results.  The  results  of  dependency 
transfer  look  promising,  and  the  problems  that  are 
evident  so  far  tend  to  be  linguistically  interesting  and 
amenable  to  language-specific  post-transfer 
processing.  As  one  example,  English  parses 
projected  into  Spanish  will  not  lead  to  useful 
dependencies  involving  the  reflexive  se  when,  as  is 
often  the  case,  it  has  no  lexically  realized 
correspondent  on  the  English  side;  post-processing 
of  the  Spanish  can  be  used  to  introduce  a 
dependency  relationship  between  the  verb  and  the 
reflexive  morpheme.  The  use  of  English-side 
information  contrasts  with  the  unsupervised 
dependency-based  translation  models  of  Alshawi  et 
al.  (2000). 

Figure  2  provides  an  illustrative  example 
using  English  and  Basque,  which  have  very  different 
linguistic  properties.  The  figure  shows  that  the  verb- 
subject,  verb-object,  and  modification  relationships 
(most  dependency  labels  suppressed)  transfer 
directly  to  the  Basque  sentence  (a  fluent  translation 
in  neutral  word  order).  The  indirect  object 
relationship  is  expressed  in  the  English  parse  via 
prepositional  modification  between  ‘got’  and  ‘for’, 
together  with  the  relationship  between  ‘for’  and 
‘brother’;  on  the  Basque  side  the  dative  component 
of  meaning  and  the  morpheme  for  ‘brother’  are 
conflated  in  the  word  ‘anaiari’;  the  resulting  pattern 
of  syntactic  dependency  links  on  the  Basque  side  can 
be  post-processed,  with  the  word-internal 
dependency  being  converted  into  a  lexical  feature. 

As  an  important  part  of  our  initial  efforts, 
we  are  developing  rigorous  evaluation  criteria  based 
on  precision  and  recall  of  dependency  triples,  using 
manually  created  dependencies  as  a  gold  standard 
and  using  inter-annotator  precision  and  recall  to 
provide  an  upper  bound. 

Improving  Quality  in  Broad-Coverage  MT 

Analysis  and  evaluation  of  MT  output  from  existing 
systems  (including  Systran)  reveals  that  there  is  a 
great  deal  of  work  to  be  done  to  provide  improved 


quality.  We  are  currently  focusing  our  efforts  on  (a) 
providing  linguistically  motivated  knowledge  to 
enhance  our  existing  source-language  parsing 
module;  (b)  using  additional  knowledge  about 
divergence  categories  to  improve  on  alignments 
between  source-  and  target-language  dependencies; 
and  (c)  conditioning  statistical  translation 
components,  including  parsing  to  and  generation 
from  dependency  structures,  on  linguistic  features 
not  currently  taken  advantage  of  in  the  traditional 
IBM-style  models. 

As  one  example,  we  take  advantage  of 
semantically  classed  verbs  (Dorr,  1997)  to  capture 
valence  and  other  linguistic  information  to  improve 
parsing  operations  such  as  PP  attachment.  For 
example,  all  verbs  in  the  class  {arrange,  immerse, 
install,  lodge,  mount,  place,  position,  put,  set, 
situation,  sling,  stash,  stow}  take  a  locative 
prepositional  phrase  as  an  argument;  if  our  training 
data  contains  only  the  most  frequently  occurring 
verbs  in  this  class  (such  as  'put’),  we  can  deduce,  by 
association,  that  others  (such  as  'sling’)  have  the 
same  PP  attachment  properties  -  and  thus  can 
improve  parsing  for  these  sparsely  occurring  verbs. 

As  another  example,  stochastic  alignment 
algorithms  are  likely  to  map  the  English  predicate 
'kick’  to  the  corresponding  French  'coup’ 
(especially  since  the  two  words  also  co-occur  as 
nouns  in  the  absence  of  'donner’  leaving  the  actual 
predicate  'donner’  unaligned  when  we  generate  the 
aligned  dependency-tree  database  (to  be  described  in 
the  next  section).  This  just  one  instance  of  a  more 
general  phenomenon:  languages  sometimes  package 
up  elements  of  meaning,  particularly  verb  meaning, 
into  different  constituents  than  English  does  (i.e., 
language  divergences).  To  address  this  issue,  we 
pre-process  the  English  using  the  semantically 
classed  verbs,  so  that  we  automatically  expand  verbs 
in  selected  divergence  classes  into  alignable 
constituents. 

A  third  example  is  the  use  of  supervised 
word  sense  disambiguation  techniques  in  lexical 
selection.  We  have  developed  a  set  of  tools  for 
supervised  WSD  that  uses  a  combination  of  broad- 
window  and  local  collocational  features  to  represent 
contexts  for  an  ambiguous  word.  A  variety  of 
classification  algorithms  can  be  used  -  we  obtained 
promising  results  for  English,  Spanish,  and  Swedish 
in  the  recent  SENSEVAE-2  evaluation  exercise 
(Cotton  et  ah,  2001)  using  support  vector  machines 
for  the  classification  process.  We  are  currently 


investigating  the  adaptation  of  this  method  to 

perform  lexical  selection,  with  the  target  English 
word  playing  the  same  role  as  the  sense  tag  for 
Spanish  words. 
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